STAT 331/531: Final Project Analysis

Author

Jenna Chan, Swara Kurakula, Chirs Li, Wesley Tam

Code
library(tidyverse)
library(knitr)
library(broom)
expenditure <- read_csv("data/expenditure_per_student_primary_percent_of_gdp_per_person.csv")
completion <- read_csv("data/primary_completion_rate_total_percent_of_relevant_age_group.csv")

1 Introduction

1.1 Detailed Data and Variable Description.

For our project, we are using two distinct datasets to analyze the relationship between government expenditure on primary education (explantory variable) and primary school completion rates (response variable) across various countries. Our explanatory and response variables were both taken from Gapminder, which got its data from The World Bank, which collects its data from the statistical systems of its member countries (https://data.worldbank.org/about).

Dataset 1: Expenditure per Student, Primary (% of GDP per Person)

This dataset contains annual data on the percentage of GDP per capita that is spent on each primary school student. GDP per capita is the sum of all goods and services produced in a country (GDP) divided by the country’s population (per capita).

The variables included are:

  1. Country: The name of the country.

  2. 1995-2018: 23 variables, 1995-2018, each representing the year of measured expenditure

The data spans a couple decades, allowing us to observe trends and changes in educational expenditure over time. Each cell represents the expenditure per student for that year and country. For instance, in 2000, Argentina allocated 12.3% of its GDP per capita to primary education per student, while by 2013, this figure had varied.

Dataset 2: Primary Completion Rate, Total (% of Relevant Age Group)

This dataset provides annual data on the primary school completion rate, indicating the percentage of primary schoolers that successfully complete primary education.

The variables included are:

  1. Country: The name of the country.

  2. 1970-2022: The year for which the data is recorded.

The data covers a wide range of years, each cell represents completion percentage for students in primary school. For example, in 1980, Afghanistan had a completion rate of 27.7%, which changed significantly over the following decades.

By combining these datasets, we aim to explore the correlation between the financial investment in education and the corresponding completion rates across different countries for the time period of 2006-2016. This analysis will help us understand the impact of educational funding on student outcomes and provide useful insights on this relationship.

Hypothesized relationship between the variables (and any outside references). Discussion of data cleaning process and decisions.

1.2 Relationship Between Variables

Presented as a percentage of the GDP per capita, government expenditure per student is defined as the average government expenditure, spread across three domains (current expenditure, capital, and transfers), per student in the given level of education. It is hypothesized that primary school completion rates across various countries increase as the government expenditure per primary student increases. This hypothesis is based on the reasoning that greater expenditures can lead to better-trained and better-paid teachers, which directly impacts student performance and completion rates. Additionally, investments in school infrastructure create a conducive learning environment, reducing dropout rates and encouraging completion. Further, additional funding can provide students with textbooks, technology, and other learning resources that support their education and reduce barriers to completion. For instance, the National Center for Education Statistics reports that in the United States, between 2019 and 2020, 11% of current expenditures on education were allocated to food contracts, janitorial services, transportation, and professional development for teachers. These expenditures are crucial for maintaining a supportive and effective learning environment. Providing nutritious meals, ensuring clean and safe facilities, and offering reliable transportation all contribute to student well-being and readiness to learn. Professional development for teachers ensures they are well-equipped with the latest teaching strategies and knowledge, directly impacting student performance and completion rates. This example illustrates how targeted educational expenditures can foster an environment conducive to higher primary school completion rates. (https://nces.ed.gov/fastfacts/display.asp?id=66)

1.3 Hypothesized Relationship Between Variables

Presented as a percentage of the GDP per capita, government expenditure per student is defined as the average government expenditure, spread across three domains (current expenditure, capital, and transfers), per student in the given level of education. It is hypothesized that primary school completion rates across various countries increase as the government expenditure per primary student increases. This hypothesis is based on the reasoning that greater expenditures can lead to better-trained and better-paid teachers, which directly impacts student performance and completion rates. Additionally, investments in school infrastructure create a conducive learning environment, reducing dropout rates and encouraging completion. Further, additional funding can provide students with textbooks, technology, and other learning resources that support their education and reduce barriers to completion. For instance, the National Center for Education Statistics reports that in the United States, between 2019 and 2020, 11% of current expenditures on education were allocated to food contracts, janitorial services, transportation, and professional development for teachers. These expenditures are crucial for maintaining a supportive and effective learning environment. Providing nutritious meals, ensuring clean and safe facilities, and offering reliable transportation all contribute to student well-being and readiness to learn. Professional development for teachers ensures they are well-equipped with the latest teaching strategies and knowledge, directly impacting student performance and completion rates. This example illustrates how targeted educational expenditures can foster an environment conducive to higher primary school completion rates. (https://nces.ed.gov/fastfacts/display.asp?id=66)

1.4 Data Cleaning

We chose to limit our analysis to the period from 2006 to 2016, as this range contained the majority of the data with minimal missing values. For each year to be considered acceptable, both expenditure and completion rate needed to be recorded. Any year missing either of these metrics was excluded. Additionally, countries with fewer than 8 acceptable measurements within this timeframe were omitted to ensure that our analysis had a sufficient amount of data for each country.

Our data cleaning process involved transforming the datasets for expenditure per person and primary school completion rates into a long format for the selected years. We then merged these datasets by country and year. And then filtered the data as mentioned earlier.

We also summarized the dataset by calculating the average expenditure and completion rate for each country over the specified period.

Code
# Pivot the data
expenditure_longer <- expenditure |>
  select(country, `2006`:`2016`) |>
  pivot_longer(cols = `2006`:`2016`,
               names_to = "year",
               values_to = "expenditure")
completion_longer <- completion |>
  select(country, `2006`:`2016`) |>
  pivot_longer(cols = `2006`:`2016`,
               names_to = "year",
               values_to = "completion_rate")

# Join the datasets
joined <- expenditure_longer |>
  full_join(completion_longer,
            join_by(country == country, year == year))

# Remove countries with too little data
ACCEPTABLE_THRESHOLD <- 8
joined <- joined |>
  mutate(is_acceptable = !is.na(completion_rate) & !is.na(expenditure)) |>
  group_by(country) |>
  mutate(num_acceptable = sum(is_acceptable)) |>
  filter(num_acceptable >= ACCEPTABLE_THRESHOLD & 
         is_acceptable == TRUE)

# Summarize the Data
avg_joined <- joined |>
  summarize(avg_expenditure = mean(expenditure),
            avg_completion = mean(completion_rate))
head(avg_joined)
# A tibble: 6 × 3
  country      avg_expenditure avg_completion
  <chr>                  <dbl>          <dbl>
1 Argentina               14.1          103. 
2 Belize                  15.8          103. 
3 Bulgaria                21.3          102. 
4 Burkina Faso            21.2           50.4
5 Cape Verde              15.6           91.7
6 Colombia                15.4          112. 

2 Linear Regression

2.1 Data Visualizations

2.1.1 Relationship between the variables

Code
viz1 <- avg_joined |>
  ggplot(aes(x=avg_expenditure, y=avg_completion)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", color = "hotpink", size = 1.5) +
  labs(title = "Completion Rate vs Expenditure Per Person (Primary School Level)",
       x = "Expenditure (% of GDP Per Person)",
       y = "",
       subtitle = "Completion Rate (% of Primary Age Group)",
       caption = "Viz 1") 
viz1

From the scatter plot we see that the data points do not align well with a straight line, suggesting that a linear model may not be appropriate. Instead, the data points appear to follow a curved pattern, indicating a potential non-linear relationship between the two variables. Additionally, we explored various transformations of the variables, including logarithmic and squared transformations. However, these transformations did not improve the fit and, in some cases, performed worse than the original linear model (analyzed through adjusted r-squared).

2.1.2 Relationship between the variables over time (2006 to 2016)

Code
#loading necessary libraries
library(countrycode)
library(gganimate)
library(gifski)

#adding continents variable
joined_conts <- joined |>
  mutate(continent = countrycode(sourcevar = country, origin = "country.name", destination = "continent"))
Code
#dont run this chunk (animated graph) in rstudio - 100 graphs get downloaded
#only run this through rendered html
viz2 <- joined_conts |>
  mutate(year = as.integer(year)) |>
  ggplot(aes(expenditure, completion_rate, color = continent)) +
    geom_point(alpha = 0.8, show.legend = TRUE) +
    labs(title = "Completion Rate vs Expenditure Per Person from 2006 to 2016 
                  (Primary School Level)",
         subtitle = 'Year: {frame_time}
         Completion Rate (% of Primary Age Group)', 
         x = "Expenditure (% of GDP Per Person)", 
         y = "",
         color = "Continent",
         caption = "Viz 2") +
    transition_time(year) +
    ease_aes('linear') +
  theme(plot.subtitle = element_text(size=10, vjust = -2),
        axis.title.x = element_text(size=10),
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt"))
viz2

The points in the graph do not exhibit a discernible pattern. However, this observation aligns with the complexity of our dataset. Each point on the graph represents a country, and each country operates within its unique legal and local context regarding expenditure per person at the primary school level. These diverse circumstances also influence the completion rates of these students.

2.2 Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as Y) and one or more independent variables (often denoted as X1,X2,…Xp). The core idea behind linear regression is to fit a linear equation to observed data that best explains the relationship between the dependent variable and the independent variables.

Linear regression assumes that there is a linear relationship between the dependent and independent variables, the errors (residuals) are normally distributed, the variance of the errors is constant (homoscedasticity) and the errors are independent of each other.

Code
model <- lm(avg_completion ~ avg_expenditure, data=avg_joined)
regression <- tidy(model)
kable(regression, caption = "Government Expenditure Regression Model")
Government Expenditure Regression Model
term estimate std.error statistic p.value
(Intercept) 83.7383373 6.9876281 11.983800 0.0000000
avg_expenditure 0.4661166 0.3787711 1.230603 0.2243464

The estimated regression model is as follows: \[{\hat{avg\_completion}} = 83.74 + 0.466(avg\_expenditure)\]

The intercept represents the estimated value of avg_completion when avg_expenditure is zero. The estimate of 83.7383 suggests that if no money is spent (avg_expenditure = 0), the average completion rate is predicted to be 83.7383 percent.

The coefficient for avg_expenditure represents the change in the average completion rate for a one-unit increase in average expenditure. The estimate of 0.4661 suggests that for each additional percentage point of average expenditure, the average completion rate increases by 0.4661 percentage points.

2.3 Model Fit

Another way to analyze our regression model is to measure its variance. In the context of our regression model, variance measures how much the predicted primary school completion rates vary around the mean of these predicted values. A low variance indicates that the predicted values tend to be close to their mean, implying our model is consistently predicting accurate completion rates. On the other hand, a high variance indicates that the predicted values are spread out over a wider range, suggesting our model’s predictions are less accurate.

Code
aug <- broom::augment(model)

variances <- data.frame(
  Variable = c("Response Values", 
               "Fitted Values", 
               "Residuals"),
  Variance = c(var(aug$avg_completion), 
               var(aug$.fitted), 
               var(aug$.resid)))

kable(variances, caption = "Regression Model Variances")
Regression Model Variances
Variable Variance
Response Values 262.905600
Fitted Values 7.881709
Residuals 255.023891

With this variance table, we can calculate our R-squared value with the formula \[R^2 = \frac{{\text{Fitted Values Variance}}}{{\text{Response Values Variance}}}\]. Using this formula gives us an R-squared value of 0.02998. This means that 2.998% of the variance seen in primary school completion rate can be explained by government expenditure. This means that our linear model is not well-fit to predict primary school completion given government expenditure. This may be because, as seen in the first graph, there does not appear to be a linear correlation between our two variables. As such, a linear regression model would not be able to effectively predict one given the other.

3 Simulation

3.1 Visualizing Simulations from the Model

With our simple linear regression, we can generate predictions using the predict() function. Then, we can add random errors to the predictions, using the residual standard error estimated from the linear regression model (acquired with sigma()).

Code
pred_avg_completion <- predict(model, data = avg_joined)

est_sigma <- sigma(model)

rand_error <- function(x, mean = 0, sd){
 sim <-  x + rnorm(length(x), mean, sd)
 return(sim)
}

set.seed(1234)
sim_response <- tibble(sim_avg_completion = rand_error(pred_avg_completion,
                                            sd = est_sigma))

Below is a scatter plot of the observed average completion versus the simulated average completion.

Code
full_data <- avg_joined |> 
  select(avg_expenditure, avg_completion) |> 
  bind_cols(sim_response)

viz3 <- full_data |> 
  ggplot(aes(x = sim_avg_completion,
             y = avg_completion)) +
  geom_point() +
  geom_abline(slope = 1,
              intercept = 0, 
              color = "orange3",
              linetype = "dashed") +
  theme(aspect.ratio = 1) +
  labs(x = "Simulated Average Completion",
       y = "",
       subtitle = "Observed Average Completion",
       caption = "Viz 3")

viz3

Here is a table of some summary statistics of a linear regression run on observed average completion versus simulated average completion:

Code
fit <- lm(avg_completion ~ sim_avg_completion, data = full_data) |> 
  broom::glance()

fit <- fit |> 
  select("r.squared", "adj.r.squared", "sigma", "statistic", "p.value")

kable(fit, 
      col.names = c("R-squared", 
                    "Adjusted R-squared", 
                    "Sigma", 
                    "F-statistic", 
                    "p-value"), 
      caption = "Summary Statistics of Linear Regression Model",
      format = "markdown")
Summary Statistics of Linear Regression Model
R-squared Adjusted R-squared Sigma F-statistic p-value
0.0169252 -0.0031376 16.23978 0.8436109 0.3628653

3.1.1 Regression Summary Statistics

  • R-squared (0.0169):

    • This low R-squared value indicates that the linear model does not explain much of the variance in the observed data. It suggests that the simulated values do not align well with the observed values.
  • Adjusted R-squared (-0.0031):

    • The negative adjusted R-squared value further confirms that the model does not fit the data well. It indicates that the model’s explanatory power is very low.
  • Residual Standard Error (16.23978):

    • A high residual standard error implies significant deviation of the observed data points from the regression line. This further indicates that the model does not capture the relationship between observed and simulated data well.

The analysis of these values indicates that while the simulated data captures the general trend and variability of the observed data, there are significant differences in terms of scatter distribution, presence of outliers, and central tendency.

3.2 Generating Multiple Predictive Checks

Code
# Generate 1000 simulated datasets
set.seed(123)  # For reproducibility
n_simulations <- 1000


# Function to add random error
rand_error <- function(x, mean = 0, sd) {
  error <- rnorm(length(x), mean = mean, sd = sd)
  x_with_error <- x + error
  return(x_with_error)
}

# Extract the observed response variable
#observed_response <- avg_joined$avg_completion



perform_predictive_check <- function(data) {
  # Fit a regression model to the observed data
  model <- lm(avg_completion ~ avg_expenditure, data = data)
  
  # Obtain predicted response values from the model
  predicted_values <- predict(model, newdata = data)
  
  # Add random errors to the predictions
  simulated_response <- tibble(sim_completion =rand_error(predicted_values, mean = 0, sd = summary(model)$sigma))
  
  full_data <- avg_joined |> 
  select(avg_completion) |> 
  bind_cols(simulated_response)
  # Perform regression of observed against simulated data
  sim_model <- lm(avg_completion ~ sim_completion, data = full_data)
  
  # Calculate R-squared
  r_squared <- summary(sim_model)$r.squared
  
  return(r_squared)
}

r_squared_values <- map_dbl(1:n_simulations, ~ perform_predictive_check(avg_joined))
#print(r_squared_values)
Code
ggplot(data.frame(R2 = r_squared_values), aes(x = R2)) +
  geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
  labs(title = "Distribution of R-squared Values",
       x = "R-squared",
       y = "Frequency")

R-squared is the proportion of the total variation in the dependent variable (Y) that is explained by the independent variable(s) (X) in the regression model. R-squared is between 0 and 1. A large R-squared means the explanatory variable is good at explaining the response. For example, \(R^2\) = 0.7 means that 70% of the variance in the dependent variable is explained by the independent variable(s).

Over 300 simulated datasets have an R-squared value of 0 which shows that they provide no explanatory power for the observed data. This suggests that the regression model, when applied to these simulated datasets, often fails to capture any meaningful relationship between the explanatory and response variables.

In addition, the highest R-squared value does not exceed 0.2 which means that even in the best cases, the model only explains up to 20% of the variance in the observed data. This is a relatively low proportion, indicating that the model has limited explanatory power. The majority of the variability in the response variable remains unexplained by the model.

The right-skewed nature of the distribution, with a high frequency of very low \(R^2\) values and a gradual decline in frequency as \(R^2\) increases, suggests that it is much more common for the model to explain very little of the variance in the data. This further supports the notion that the model generally performs poorly in capturing the relationship between the variables.

Using the above, we can conclude that our model does not generate data similar to what was observed.

4 References